Analyse des performances de modèles de langage sub-lexicale pour des langues peu-dotées à morphologie riche (Performance analysis of sub-word language modeling for under-resourced languages with rich morphology: case study on Swahili and Amharic) [in French]

نویسندگان

  • Hadrien Gelas
  • Solomon Teferra Abate
  • Laurent Besacier
  • François Pellegrino
چکیده

Performance analysis of sub-word language modeling for under-resourced languages with rich morphology : case study on Swahili and Amharic This paper investigates the impact on ASR performance of sub-word units for two underresourced african languages with rich morphology (Amharic and Swahili). Two subword units are considered : syllable and morpheme, the latter being obtained in a supervised or unsupervised way. The important issue of word reconstruction from the syllable (or morpheme) ASR output is also discussed. For both languages, best results are reached with morphemes got from unsupervised approach. It leads to very significant WER reduction for Amharic ASR for which LM training data is very small (2.3M words) and it also slightly reduces WER over a Word-LM baseline for Swahili ASR (28M words for LM training). A detailed analysis of the OOV word reconstruction is also presented ; it is shown that a high percentage (up to 75% for Amharic) of OOV words can be recovered with morph-based language model and appropriate reconstruction method. MOTS-CLÉS : Modèle de langage, Morphème, Hors vocabulaire, Langues peu-dotées.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Analyse des performances de modèles de langage sub-lexicale pour des langues peu-dotées à morphologie riche

Performance analysis of sub-word language modeling for under-resourced languages with rich morphology : case study on Swahili and Amharic This paper investigates the impact on ASR performance of sub-word units for two underresourced african languages with rich morphology (Amharic and Swahili). Two subword units are considered : syllable and morpheme, the latter being obtained in a supervised or...

متن کامل

A State of the Art of Word Sense Induction: A Way Towards Word Sense Disambiguation for Under-Resourced Languages

______________________________________________________________________________________________ Word Sense Disambiguation (WSD), the process of automatically identifying the meaning of a polysemous word in a sentence, is a fundamental task in Natural Language Processing (NLP). Progress in this approach to WSD opens up many promising developments in the field of NLP and its applications. Indeed, ...

متن کامل

External Lexical Information for Multilingual Part-of-Speech Tagging

Morphosyntactic lexicons and word vector representations have both proven useful for improving the accuracy of statistical part-of-speech taggers. Here we compare the performances of four systems on datasets covering 16 languages, two of these systems being feature-based (MEMMs and CRFs) and two of them being neural-based (bi-LSTMs). We show that, on average, all four approaches perform similar...

متن کامل

Is Bad Structure Better Than No Structure?: Unsupervised Parsing for Realisation Ranking

In natural language generation using symbolic grammars, state-of-the-art realisation rankers use statistical models incorporating both language model and structural features. The rankers depend on multiple structures produced by the particular large-scale symbolic grammars to rank the output; for languages with smaller resources and in-development grammars, we look at the feasibility of an alte...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012